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In this paper, we propose a class of Bayes estimators for the 
covariance matrix of graphical Gaussian models Markov with respect 
to a decomposable graph G. Working with the Wp^^ family defined 
by Letac and Massam [Ann. Statist. 35 (2007) 1278-1323] we derive 
closed-form expressions for Bayes estimators under the entropy and 
squared-error losses. The Wp^ family includes the classical inverse 
of the hyper inverse Wishart but has many more shape parameters, 
thus allowing for flexibility in differentially shrinking various parts of 
the covariance matrix. Moreover, using this family avoids recourse to 
MCMC, often infeasible in high-dimensional problems. We illustrate 
the performance of our estimators through a collection of numerical 
examples where we explore frequentist risk properties and the efficacy 
of graphs in the estimation of high-dimensional covariance structures. 

1. Introduction. In this paper we consider the problem of estimation of 
the covariance matrix S of an r-dimensional graphical Gaussian model. Since 
the work of Stein [35] the problem of estimating S is recognized as highly 
challenging. In recent years, the availability of high-throughput data from 
genomic, finance, marketing (among others) applications has pushed this 
problem to an extreme where, in many situations, the number of samples 
(n) is often much smaller than the number of parameters. When n <r the 
sample covariance matrix S is not positive definite but even when n> r, the 
eigenstructure tends to be systematically distorted unless r/n is extremely 
small (see [12, 35]). Numerous papers have explored better alternative esti- 
mators for S (or S~^) in both the frequentist and Bayesian frameworks (see 
[4, 8, 9, 15, 16, 17, 19, 25, 26, 29, 35, 37]). Many of these estimators give 
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substantial risk reductions compared to the sample covariance estimator S 
in small sample sizes. A common underlying property of many of these es- 
timators is that they are shrinkage estimators in the sense of James-Stein 
[19, 34]. In particular the Bayesian approach often yields estimators which 
"shrink" toward a structure associated with a prespecified prior. One of the 
first papers to exploit this idea is [4] which shows that if the prior used 
on is the standard conjugate, that is, a Wishart distribution, then for 
an appropriate choice of the shape (or shrinkage) and scale hyperparame- 
ters, the posterior mean for E is a linear combination of S and the prior 
mean (see Section 3.1). It is easy to show [see (3.16)] that the eigenvalues of 
such estimators are also shrinkage estimators of the eigenvalues of S. More 
recently, for high-dimensional complex datasets with r often larger than 
n, regularization methods have been proposed, which impose structure on 
the estimators through zeros in the covariance or the precision matrix (see 
[2, 18, 30]). The idea of imposing zeros in the precision matrix is not new, 
however, and was introduced in [12] in a pioneering paper on covariance 
selection models which are particular cases of graphical Gaussian models. 
Graphical Gaussian models have proven to be excellent tools for the analysis 
of complex high-dimensional data where dependencies between variables are 
expressed by means of a graph [3, 21]. 

In this paper we combine the regularization approach given by graphical 
models with the Bayesian approach of shrinking toward a structure. Through 
a decision-theoretic approach, we derive Bayes estimators of the covariance 
and precision matrices under certain priors and given loss functions, such 
that the precision matrix has a given pattern of zeros. Indeed, we work 
within the context of graphical Gaussian models Markov with respect to a 
decomposable graph G. Restricting ourselves to decomposable graphs allows 
us to use the family of inverse Wp^ Wishart distributions [27] as priors for 
S. This is a family of conjugate prior distributions for which includes 
the Wishart when G is complete (i.e., when the model is saturated) and 
the inverse of the hyper inverse Wishart, the current standard conjugate 
prior for when the model is Markov with respect to G decomposable. 
A potentially restrictive feature of the inverse of the hyper inverse Wishart 
(and the Wishart) is the fact that it has only one shape parameter. The 
family of Wp^ Wishart distributions considered here has three important 
characteristics. First, it has k + 1 shape parameters where k is the number 
of cliques in G. Second, it forms a conjugate family with an analytically 
explicit normalizing constant. Third, the Bayes estimators can be obtained 
in closed-form. 

In Section 2, we give some fundamentals of graphical models. In Section 
3, we recall the properties of the VFp^ family and its inverse, the IWp^, and 
we derive the mathematical objects needed for our estimators, that is, the 
explicit expression for the mean of the IW p^ . Parallel to the development 
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of the IWp^, we present in Section 4 a noninformative reference prior for 
T, (and the precision matrix 0). While offering an objective procedure that 
avoids the specification of hyperparameters, the reference prior also allows 
for closed-form posterior estimation as the posterior for S remarkably falls 
within the IWp^ family. In Section 5, we derive the Bayes estimator under 
two commonly used loss functions adapted to graphical models and the prior 
considered in Sections 3 and 4. Finally, in Sections 6 and 7 we compare the 
performance of our estimators in a series of high-dimensional examples. 

2. Preliminaries. Let G = {V, E) be an undirected graph with vertex set 
F = {1, . . . , r} and edge-set E. Vertices i and j are said to be neighbors in G 
if G E. Henceforth in this paper, we will assume that G is decomposable 
[24], where a perfect order of the cliques is available. For (Ci, . . . , C^) in a 
perfect order, we use the notation Hi = Ri = Ci while for j = 2, . . . , A; we 
write 

Hj = Ci U • • • U Cj, Rj = Cj\ Hj^i, ~ 1 ^^j- 

The Sj,j = 2, . . . ,k are the minimal separators of G. Some of these separators 
can be identical. We let k' <k — 1 denote the number of distinct separators 
and i^{S) denote the multiplicity of S, that is, the number of j such that 
Sj = S. Generally, we will denote by C the set of cliques of a graph G and 
by S its set of separators. 

An r-dimensional Gaussian model is said to be Markov with respect to 
G if for any edge not in E, the ith and jth variables are condition- 
ally independent given all the other variables. Such models are known as 
covariance selection models [12] or graphical Gaussian models (see [24, 36]). 
Without loss of generality, we can assume that these models have mean zero 
and are characterized by the parameter set Pq of positive definite precision 
(or inverse covariance) matrices such that Qij = whenever the edge (i, j) 
is not in E. Equivalently, if we denote by M the linear space of symmetric 
matrices of order r, by C M the cone of positive definite (abbreviated 
> 0) matrices, by Iq the linear space of symmetric incomplete matrices x 
with missing entries Xij, ^ E and by k: M Iq the projection of M 
into Ig, the parameter set of the Gaussian model can be described as the 
set of incomplete matrices S = K,{yi~^),^ £ Pq- Indeed it is easy to verify 
that the entries Sjj, (i, j) ^ E are such that 

(2-1) ^ij = ^i,V\{i,j}^V\{i,j}y\{ij}^y\{h3},V 

and are therefore not free parameters of the Gaussian models. We are there- 
fore led to consider the two cones 



(2.2) 
(2.3) 



PG = {y€M+|y,,=0,(i,i)^i?}, 
Qg = {x £Ig\xc, >0,i = l,...,A;}, 
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where Pq C Zq and Qg C /gj where Zq denotes the hnear space of sym- 
metric matrices with zero entries T/jj, (i, j) ^E. 
Grone et al. [14] proved the following: 

Proposition 2.1. When G is decomposable, for any x in Qg there 
exists a unique x in such that for all {i,j) in E we have Xij = Xij and 
such that x~^ is in Pq- 



This defines a bijection between Pq and Qg'- 

(2.4) ip:y= {x)~^ ePG^x = ip{y) = K{y^^) £ Qg, 

where k denotes the projection of M into Ig- 

If for any complete subset A (ZV, xa = {xij)i,jeA is a matrix and we 
denote by {xa)^ = {xij)i^j^y the matrix such that Xij = for (i, j) ^ Ax A, 
then the explicit expression of x~^ is 

(2.5) y = x-' = Y^ (x^y - J2 l^{S){Xs'f. 

c&c ses 
For (x,y) G Ig x Zg, we define the notion of trace as follows: 

(2.6) tY{xy) = {x,y) = ^ Xijyij. 

Note that for x G Qg and y G Pg, {x,y) =tv{xy), where ti{xy) is defined in 
the classical way. In the sequel, we will also need the following. If, for y € Pg 
we write y = with a G Qg, we have, for x G Qg, the two formulas 

(2.7) {x,a-') = Y,{xc,ac') - E K5)(x5, a^i), 

Cec ses 

(2.8) detx- ^CecidetxG) 



Usesi^^txsr^^y 

The graphical Gaussian model Markov with respect to G is therefore the 
family of distributions 

Mg = {Nr{0, S), S G Qg} = {iVr(0, S), J] = G Pg}. 

In this paper, we will study various estimators of S G Qg and 0, G Pg- We 
will write mle and mlcg for "maximum likelihood estimate" in the satu- 
rated model and in the graphical model, respectively. Also, in this paper, 
we will use the general symbol 6 to denote an estimator of 6 rather than 
the traditional 6 as the notation 6 has been reserved for the completion 
process (see Proposition 2.1). The mlcg, Qg, for the parameter Q G Pg in 
A/g is well known (see [24], page 138). If Zj, i = 1, . . . , n is a sample from the 
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Nr{0, H) distribution in Mg, if we write U = X;r=i ^i^i and 5 = f and if 
n > maxcgc|C|; then Qg exists and is equal to 

(2.9) ng = J2iSc'f-Y.''iS){Ss'f, 

C&C S£S 

where clearly S" as a subscript or 5 in i^(«S') refers to the separator while the 
remaining S's refer to the sample covariance matrix. If we assume that the 
graph is saturated, then, clearly the mle is Cl = S~^. 

Finally, we need to recall some standard notation for various block sub- 
matrices: for X € Qg, xCj-,3 = 1, . . . , /c are well defined and for j = 2, . . . , fc, 
it will be convenient to use the following: 

(2.10) _ 

X[j^=XR^, Xy].=Xy]-2;y)X^^.^X(j-], 

where xi^j^ £ M+,a:;y]. G M^_^_^.,xy) G L{M^i~^3 the set of linear ap- 

plications from M^J"''^ to M'^J. We will also use the notation x^i2) and x^^. 
for 

X[12) = XCl\S2,S2^S2 ^[1]- = ^Cl\S2-S2 = ^Cl\S2 - X Cl\S2, 82^32 ^S2,Cl\S2- 

3. Flexible conjugate priors for S and ft. When the Gaussian model 
is saturated, that is, G is complete, the conjugate prior for as defined 
by Diaconis and Ylvisaker [13] (henceforth abbreviated DY) is the Wishart 
distribution. The induced prior for S is then the inverse Wishart IWr{p,0) 
with density 

(3.1) IWr(p, e- dx) = -^|xrP-('^+^)/2 exp -(0, x~^)1m+ (x) dx, 

where p > is the shape parameter, Tj.{p) the multivariate gamma func- 
tion (as given on page 61 of [31]) and 9 G M"*" is the scale parameter. 

As we have seen in the previous section, when G is not complete, 
is no longer the parameter set for E or the parameter set for Q. The DY 
conjugate prior for S G Qg was derived by [11] and called the hyper inverse 
Wishart (HIW). The induced prior for G Pg was derived by [32] and we 
will call it the G- Wishart. The G- Wishart and the hyper inverse Wishart 
are certainly defined on the right cones but they essentially have the same 
type of parametrization as the Wishart with a scale parameter 6 G Qg and 
a one-dimensional shape parameter 6. 
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3.1. The distribution and its inverse. Letac and Massam [27] intro- 
duced a new family of conjugate priors for O € Pq with a A: + 1-dimensional 
shape parameter thus leading to a richer family of priors for fi, and there- 
fore for S = fi;(Q~^) € Qg through the induced prior. It is called the type 
II Wishart family. Here, we prefer to call it the family of W^p^-Wishart 
distributions in order to emphasize that it is defined on Pq- Details of this 
distribution can be found in Section 3 of [27]. We will first recall here some of 
its main features and then derive some new properties we shall need later in 
this paper. Let a and (3 be two real-valued functions on the collection C and 
S of cliques and separators, respectively, such that a(Q) = ai,(3{Sj) = (ij 
with f5i = Pj if Si = Sj. Let Ci = \Ci\ and Sj = \Si\ denote the cardinality 
of Ci and Si, respectively. The family of Wp^ -Wishart distributions is the 
natural exponential family generated by the measure HQ{a, (3 ,ip{y))vG{dy) 
on Pq where (p{y) is as defined in (2.4) and where, for x e Qg, 

(3.2) HGia,P;.)- Hccldet. .)"(-) 



n5e5(det:E5)'^(^)/5(^)' 

(3.3) VGidy) = Hg{\{c + 1), \{s + l)-^{y))lp^{y) dy. 

The parameters {a, (5) are in the set B such that the normalizing constant is 
finite for all 6 G Qg and such that it factorizes into the product of HG{a, j3; 9) 
and a function Tu{o.,l3) of {a,j3) only, given below in (3.5). 

The set B is not known completely but we know that B 5 Up Bp where, 
if, for each perfect order P of the cliques of G, we write J{P, S) = {j = 
2, . . . , k\Sj = S}, then Bp is the set of (a, j3) such that: 

1- Z]jej(p,5)(c'^j + ~ Sj)) ~ ^{'S)f3{S) = 0, for all S different from 5*2; 

2. —aq — ^(cq — — 1) > for all g = 2, . . . , and —ai — ^(ci — S2 — 1) > 0; 

3. -ai - ^(ci -S2 + 1) -72 > where 72 = EjGj(p,52)(«i ~ + ^^^)- 

As can be seen from the conditions above, the parameters P{S),S G S 
are linked to the q(C),C€C by k' — 1 linear equalities and various linear 
inequalities and therefore B contains the set Bp dimension at least k + 1, 
for each perfect order P. We can now give the formal definition of the Wp^ 
family. 

Definition 3.1. For {a, (3) G B, the VFp^-Wishart family of distribu- 
tions is the family T(^a,(3),PG ~ {^Pg(^^ f^'^'^^v)^^ ^ Qg} where 

,3.4) ^,„,„,,.M.)^.-<»*>f^;^|||^.c(..) 

and 
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(3.5) 



Cl - S2 



■72 



-ai)nrc,-s,(-a,). 



i=2 



We can also, of course, define the inverse Wpg(Q,/3,0) distribution as fol- 
lows. If y ~ Wpg(a,/3,6'), then X = ip{Y) IWp^ia, (3,9) with distribution 
on Qq given by (see (3.8) in [27]) 



(3.6) 
(3.7) 



IWp^{a,(3,9;dx) 



Tn{a,p)HG{a,P;ey 
where ^icidx) = Hg{-1{c+ l),-^(s + 1); x)Iqq(x) dx. 
The hyper inverse Wishart is a special case of the IWp^ distribution for 

d + a-l 

ai = - 

(3.8) 

(3i = - 



2 

(5 + Si 



1, . . . , /c. 



2, . . . , /c, 



which are all functions of the same one-dimensional parameter 5. It is tra- 
ditional to denote the hyper inverse Wishart, that is, this particular IWp^, 
as the HIW{5,9) and this is the notation we will use in Section 6. 

Corollary 4.1 of [27] states that the IWp^ is a family of conjugate distri- 
butions for the scale parameter S in Afc] more precisely, we have: 



Proposition 3.1. Let G be decomposable and let P be a perfect order 
of its cliques. Let {Zi, . . . , Zn) be a sample from the Nr{0,T,) distribution 
with T, G Qg- If the prior distribution on 2S is IW p^{a, (3,6) with {a, (3) G 
Bp and 9 S Qg, the posterior distribution of 2S, given nS = J2i'=i ^i^f, 
is IWp^i^a — §, /? — f 1 ^ + i^in-S)), where a — ^ = (ai — ^, . . . , — ^) and 
/?-§ = (/?2-§, ■ • ■ are such that (a-§,/3-§) e Bp and9 + K{nS) G 

Qg so that the posterior distribution is well defined. Equivalently, we may 
say that if the prior distribution on is WpQ{a, (3,9) , then the posterior 
distribution of is Wp(j{a — ^, P — ^,9 + K{nS)) . 

For the expression of the Bayes estimators we will give in Section 5, we 
need to know the explicit expression of the posterior mean of and S when 
the prior on Q is the Wp^ or equivalently when the prior on S is the IW . 
The mean of the Wp^ can be immediately obtained by differentiation of the 
cumulant generating function since the Wp^ family is a natural exponential 
family. Prom (3.4), from Corollary 3.1 and from (4.25) in [27], we easily 
obtain the posterior mean for $7 = S"^ as follows. 
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Proposition 3.2. Let S and il. be as in Corollary 3.1; then the posterior 
mean of given nS, is 



E{n\S) = -2 



(3.9) 



1x0 



1\0 



i=2 



Since the IWp^ is not an exponential family, its expected value is not as 
straightforward to derive. It is given in the following theorem. 

Theorem 3.1. Let X be a random variable on Qg such that X ~ 
IW p^{a,[3,9) with (a,/3) G Bp and 6 £ Qg; then E{X) is given by (3.10)- 
(3.14): 



(3.10) 



(2). 



(3.11) E{xc,\s, 



tS2 J 



H2) 



-(ai + ((ci - 82) /2) + 72) - ((S2 + l)/2) 



-(ai + ((ci + l)/2)+72)' 

^Ci\52,52 

-(ai + ((ci + l)/2)+72)^ 



(3.12) 



\S2 



(Ql + ((ci-S2 + l)/2)) 
S2 



X 1 



+ 



2(ai + ((ci + l)/2)+72) 



^Cl\S2,S2^{2)^S2,Cl\S2 



-(ai + ((ci + l)/2)+72)' 
and for j = 2, . . . ,k 

(3.13) E{x[^)) = i?((x[,)X^.;))i?(x(,-)) = ey^9^]E{x^^)), 



E{xy]) 



(3.14) 



-{aj + iicj-Sj + l)/2)) 



(l + itr(e^.;E(x(,)))) 



The proof is rather long and technical and given in the Appendix. Let us 
note here that (3.10)-(3.14) can also be written in a closed-form expression 
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of the Choleski type, that is, E{X) = T^DT with T lower triangular and D 
diagonal, where the shape parameters (a,/3) are solely contained in D. We 
do not give it here for the sake of brevity. 

The important consequence of this theorem is that, from (3.10) to (3.14), 
we can rebuild E[X) G Qg- Indeed, by definition of Qg,E{X) is made up 
first of E(Xc-i^) which is given by (3.10), (3.11), and its transpose and (3.12) 
and then, successively, of the jth "layer": E{X[j'^) and its transpose, and 
E{Xy^), for each j = 2, . . . ,k. These are immediately obtained from (3.13) 
and (3.14) since, by definition, Sj C Hj-i and therefore the quantity E[X(^j^) 
is a sub-block of E[XHj_^) and has therefore already been obtained in the 
first J — 1 steps. We can therefore now deduce the posterior mean of S when 
the prior is /l/Fp^ (a, /3, 0). 

Corollary 3.1. Let S and E he as in Corollary 3.1; then the posterior 
mean for E when the prior distribution on 2E is IW PQ{a, /3,9) is given by 
(3.10)-(3.14) where X is replaced by 2E, 9 is replaced by 9 + K{nS) and 
{ai,Pi) 's are replaced by ai — ^,(3i — ^ 's. 

3.2. Shrinkage by layers and the choice of the scale parameter 9. When 
we use the /VKp^ (a, /3, ^) as a prior distribution for the scale parameter 
S, we have to make a choice for the shape hyperparameters (a,/3) and the 
scale hyperparameter 9. When G is complete, the IWp^ {a, 13, 9) becomes the 
regular inverse Wishart IW{p, 9) as given in (3.1). When G is decomposable 
and one uses the hyper inverse Wishart HIW {5,9), in the absence of prior 
information, it is traditional to take 9 to be equal to the identity or a multiple 
of the identity and 5 small, such as 3, for example (see [21]). 

The scale parameter, however, can play an important role if we have some 
prior knowledge on the structure of the covariance matrix (see [4]) and we 
are interested in "shrinking" the posterior mean of E toward a given target. 

In the saturated case, for a sample of size n from the A^(0, E) distribution 
with a Wishart W{^, {v'D)~^) prior on $7 = E~^, the posterior mean of E is 

(3.15) EinS) "^ + "^ 



u + n — r — 1 

First, we note that when n is held fixed and v is allowed to grow, the 
posterior mean tends toward D while if is held fixed and n is allowed to 
grow, the estimator tends toward S. Next, let us consider the eigenvalues of 
the posterior mean. If we take D = 11 where / is the average of the eigenvalues 
li,...,lr of the mle S, then it is easy to see that the eigenvalues gi, i = 
l,...,r of ^(E|5) are 
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nearly a weighted average of / and li. Some simple algebra will show that 
for li <l we always have li < gi and that for i such that U > I, that is, for 
Cj = J > 1, we will have gi < li whenever v > -^^{r + 1). Since in order for 
the prior to be proper, we must have that v > r — l,we see that this condition 
is very weak as long as is close to 1. When the condition > r — 1 is 

satisfied, the eigenvalues of the posterior mean are shrunk toward / and the 
span of the eigenvalues of -©(SIS') is smaller than the span of the eigenvalues 
of 5, which generally can be used to correct the instability of S. (We note 
that if Cj = ^ is sufficiently large, ^^^^ sufficiently close to 1 and if 
J is close to 1, then there is really no need to shrink the eigenvalues.) 

In Section 5 we show that our Bayes estimators can be expressed in terms 
of the posterior mean of S and with the IW and the Wp^ , respectively, 
as priors. One would like to be able to prove properties for the eigenvalues 
of our estimators similar to those of the posterior mean under the Wishart 
in the saturated case. This is beyond the scope of this paper. However, we 
observe in the numerical examples given in Sections 6 and 7 that the eigen- 
values of our estimators do have shrinkage properties. With this motivation, 
in Sections 6 and 7, we will use IW p^ priors with 6 so that the prior mean of 
T, is the identity as well as with 6 equal to the identity. Thus, we first derive 
d so that E{T,) = ^E{IW p^{a, P,6)) = I and then, we will argue that our 
estimators can be viewed as shrinkage estimators in the sense of shrinkage 
toward structure. 



Lemma 3.1. Let S G Qc be such that 2S ~ IWp^{a,f3,6) for given 
(a,/3) € A. In order to have E{T,) = I it is sufficient to choose 6 as a diag- 
onal matrix with diagonal elements equal to 

V 2 A 2(ai + ((ci + l)/2) + 72) 

eii = -2(^ai + ^^^+^2^-{s2 + l) for I G {2), 

On = -2 (a, + 2rill±l^ [i + ltr(0^-;i?(xo))))~' 

/orZe[i],J=2,...,fc. 

The proof is immediate from (3.10)-(3.14). 

Let us now argue that one of our estimators (to be derived in Section 

~Wp 

5), equal to the inverse of the completion of the posterior mean 

E{Ti\S) of S when S ~ /14^Pg(a,/3,0), can be viewed as a shrinkage esti- 
mator. It follows from Theorem 4.4 of [27] that, when the prior on S is 
the /W^Pq (a, /?, 0), /M^c.-^. (— aj, 6'[j].) as defined in (3.1). Then, since 



for I e [II 
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uSh, ~ W\H^\{^,T.H,) and thus nS[i]. ~ Wci-si{^^^,^[{\.), through an ar- 
gument parahel to the one for (3.15), it fohows that the posterior mean 
E{T,[^.\S) is a Unear combination of S[ij. and with Qj playing a role 
parallel to that of u in (3.15), and is therefore a shrinkage estimator of ^[i].- 
Thus, we can shrink with different intensities various parts of the matrix S. 
The posterior mean i?(5]|S') is reconstructed, layer by layer, as can be seen 
in the proof of Theorem 3.1 using i?(S[j].|S') as a building block [see (A. 4)] 
through what we might call a conditional Choleski reconstruction. The re- 
sulting estimate can therefore be regarded as a shrinkage estimator. 
A similar argument can be made for all our Bayes estimators. 



4. Reference prior. In this section, we derive a reference prior for T, and 
therefore Q (see [1, 5, 6]). This is done first by reparametrizing the density 
of the mle k{S) of S with a parametrization naturally induced by a given 
perfect order P of cliques. The parameters are, in fact, the elements of the 
Choleski decomposition of S for the order of the vertices given by P. As we 
will see below, the density of the mle of S belongs to a natural exponential 
family and we will therefore follow the method given by Datta and Ghosh 
[10] and later used by Consonni and Veronese [7] in the context of general 
Wishart distributions, to derive the reference prior for the new parameter. 
We will then consider the induced prior on S. It was shown in [27], equations 
(4.14)-(4.18), that if, for X = K{nS) and a given perfect order of the cliques 
P, we make the following change of variable: 

X^^= (X[i]., X[i2) , X(2) , X[j]., X[j)X^.|, j = 2, . . . , fc), 

then the density of the new variable H is 

I -1|+(S2/2)| |{ci-S2)/2 
(4.1) X (e-<("ll2> — [12) (-[1,2> -[1,2) )-(2) > ^^^^^^ ) ^^^^^ 

i=2 
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-1 ^ 

i=2 

and the new parametrization replacing a, in the order induced by the order 
of the new variables, is clearly 

(4.2) (o-(2), (<Tf^pCr[i2)), (cJypCJy)0-^.J),i = 2,...,k). 

We now derive the reference prior for (p and the induced prior for a. 

Theorem 4.1. Consider the scale parameter S € Qg for the Gaussian 
model Mg- Let a = 211 and let (p be the ordered parameter as defined in 
The reference prior for (p is independent of the order of the components and 
has density equal to 

(4.3) = nkbl.l"'^"''^/'^-^^ 
Moreover the induced reference prior for the parameter a € Qg is 

|^^J-{c.+l)/2n^^^| |-{c,+l)/2 



(4.4) 7r^(a) 



|0-^^|({ci+C2)/2)-S2-({s2+l)/2) J]fe^3|o-_5j((c,-s,)/2)-({s,+l)/2) 

which corresponds to an improper /VKp^ (a, 0) distribution with 
aj = 0, j = l,...,k, 

(4.5) 

Cl + C2 Cj -Sj . 

P2 = — S2, [3j= ' % J = 3,...,fc. 



The proof of the theorem is given in the Appendix. Let us note here that 
the fact that the induced prior on a is an IWp^, albeit an improper one, is 
not too surprising since (see Theorem 4.1 of [27]) the IW is a conjugate 
distribution for the scale parameter of the distribution of K{nS). 

Because the distribution (4.4) has the form of an IWp^, its posterior 
given U = nS is an IW with parameters 
n 

aj = 0--, j = l,...,k, 

(4.6) 

ci + C2 n n Cj-Sj n . 

/?2 = — ^ = 2 2' J = 3,---,fc 

and 

e = K{nS). 
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Of course, K{ns) S Qg and it is easy to check that {a,f3) G B, that is, the 
posterior distribution is a proper IWp^. As in Section 3, we now need to 
compute the exphcit expression for E{Q\S) and EC^IS) when the prior 
distribution on o" = 2S is the objective prior (4.4). From Proposition 3.2 
and (3.9), we immediately obtain the following. 

Corollary 4.1. Let Zi,i = 1,. . . ,n be a sample from the N{0, T,) dis- 
tribution with S € Qg- Let U = ^27=1 ^i^f ^'^^ the prior distribution on 
2S be as in (4-4)- Then the posterior mean of ^ = is 



,=1 'V n J 

(4.7) 



k _ 



It is interesting to note here that when n tends to +oo, the expression of 
the posterior mean in (4.7) becomes very close to the expression of the mleg 
of $7 as given in (2.9). This will also be illustrated by our numerical results 
further in this paper. 

From Theorem 3.1, we immediately derive the posterior mean of S as 
follows. 



Corollary 4.2. Let U and S as above; then E{2Tj\S) is given by 
(3.10)-(3.14) for 6 = 7r{nS) and {a, (3) as in (4-6) with the additional con- 
dition that {a, (3) in (4-6) satisfy the inequalities 

Cl + 1 Cl — So + 1 
ai + + 72 < 0, ai + < 0, 

Ci — So • + 1 
aj+ ' ' <0, j = 2,...,k. 
2 

We note here that the additional conditions imposed on the posterior 
hyperparameters are there to insure that the moments given in (3.10)-(3.14) 
exist. 



5. Decision-theoretic results. In this section, we will derive the Bayes 
estimators for S and 0, under two loss functions similar to the classical Li 
and L2 loss functions, but adapted to Qg and Pg- These Bayes estimators 
are of course computed with respect to a given prior distribution. The prior 
distributions we will consider are the HLW and the IW as recalled in 
Section 3, and the reference prior as developed in Section 4. 
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5.1. Bayes estimators for S and $7. We now proceed to place the covari- 
ance estimation problem in graphical models in a decision-theoretic frame- 
work. Let us first recall what is traditionally done in the saturated case, that 
is, when G is complete. Given a sample of size n from a A'r(0, S) distribu- 
tion, letting S be any estimator of S based on that sample, we consider the 
following two loss functions: 

Li(S, S) = (S, S-^) - log V ^ 

(5-1) 

L2(S,S) = (S-S,S-S), 

called Stein's (or entropy, or likelihood [19, 34]) and squared-error (or Frobe- 
nius [25, 28]) losses, respectively. Other losses have also been considered in 
the literature (see [33] for details). Many authors such as Haff [15], Krish- 
namoorthy [23] and Krishnamoorthy and Gupta [22] have also considered 
the estimation of the precision matrix Q, = S^^, instead of S. The reader 
is referred to [37] for a more complete list. The natural analogues for $7 of 

(5.1) are 

Li(J7,J7) = -log|ilO"i| -r, 

(5.2) ... . ^ 

L2{n, n) = {n-n,n-n) = tr(f] - nf. 

A question that naturally arises in various contexts in multivariate analysis 
and related topics is whether to estimate S or its inverse = We choose 
to focus on the estimation of both S and in this paper for a variety of 
reasons. The parameter S has a natural and well-understood interpretation 
in multivariate analysis and its direct estimation has numerous applications. 
The precision matrix, on the other hand, has a natural and central place in 
Gaussian graphical models as it is the canonical parameter of the natural 
exponential family JVg and it sits in the parameter set Pq as defined in 
(2.2), a parameter set of dimension much smaller than that of . 

To our knowledge, in the case where G is decomposable and not complete, 
a decision-theoretic estimation of the scale parameters S or has not been 
previously considered. We will do so now. We first observe that the tradi- 
tional loss functions used for saturated models need to be reconsidered for 
graphical models as we now have fixed zeros in the inverse covariance matrix 
and therefore fewer parameters. This is clear in the expression of L2(f^,f^) 
in (5.2) when O is in Pq- Indeed, we have 

dependent not on r(r -|- l)/2 parameters but on the nonzero parameters 
only. Similarly, because of the structural zeros of O G Pq, Li (0,(7) in (5.2) 
depends only on the nonzero elements of 0. We also note that, according 
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to (2.7) and (2.8), for an arbitrary decomposable graph G, when S and S 
both belong to Qg, (5-1) can be written as Li(S,S) = Li(S,S), where 

Li(S, S) = 5] (Sc, S^^) - E (^5' ^5') 

(5-3) 

ricGC l^cl rises l^sl 
which involves solely the elements of S G Qg and S G and not the nonfree 

elements of their completions S and S. Accordingly, we shall modify the 
traditional L2 loss function for S G as follows so that, like Li(S,S), it 

depends only on the free parameters of S. For S and S in we define 

(5.4) L2(S,S) = (S-S,S-S)= ^ {%j-T.i,f. 

We have therefore modified, when necessary, the traditional loss functions 
given by (5.1)-(5.2) to take into account the graphical nature of the covari- 
ance matrix. We now derive the corresponding Bayes estimators for these 
newly defined loss functions and given priors. 

Proposition 5.1. Let Zi,i = l,...,n and U = nS be as in Corollary 
3.1. Then, for a given prior vr(S) on S G Qg, the Bayes estimators of S 
under (5.3) and (5.4) are equal to, respectively, 

(5.5) i:l'^P''^=K{[E-^^\^\t-')r') and Sjf = i?-(^l^)(S), 

where 7r(S|C/) denotes the posterior distribution of T, given U. 

For a given prior 7r(r2) for Q G Pg, the Bayes estimators of Q under the 
loss functions in (5.2) are equal to, respectively, 

(5.6) Ojfl^) = [^-(^1^0(^0-1))]'^ and O^f = ^"(^l^)(0). 

Proof. We first derive the expression of S^^^^^'^^ in (5.5). The Bayes 
estimator is the estimator S that minimizes the posterior expected loss. So 
for Li loss, we have 

£;"(^l^)[Li(S,S)] = j -log|SS-V^]7r(S|f/) 

= (S,^'^(^l^)[S-i]) -log|S| -^^(^l^)[log|S-^[] -r 

= (S,i?(C/)) -log|S| -c 

for some constant c and R{U) = E'^^^^^'^[T,~^]. Minimizing the posterior ex- 
pected loss is equivalent to maximizing the function log — (S,i?(C/)) with 
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respect to S. This function is concave and is maximized at S = R{U) ^ = 
|-^7r(s|(/) p-ijj-i jjgj^gg yielding the Bayes estimator of S for Li loss. 

Let us now derive the expression of S^2^'^^ in (5.5). Once more the L2 
Bayes estimator is found by minimizing the posterior expected loss. Now, 

Rl, [S, S] = / ^ - Si,)V(S|C/) dS 

Since we are minimizing a sum of terms, it is sufficient to minimize each one 
of the terms. It is well known (see [15]) that each one of these is minimized 
for = £"^^^l^^(Sjj), which gives us the desired expression for S^^^'^''. 
The proofs for the Bayes estimators of follow along similar lines. □ 

We derived the expression of Sj^^^'^^ above, even though it is straight- 
forward, in order to emphasize the fact that, unlike in the classical case 
when one estimates a complete covariance matrix, the posterior means are 
Bayes optimal only if the L2 loss function is modified to reflect the graph 
that underlies S. In other words the posterior mean will not be the Bayes 
estimator if nonfree elements of the matrix S contribute to the loss, a point 
that can be easily overlooked when considering decision-theoretic estimation 
for graphical models using the traditional loss functions. 

It is important to note here, since it will simplify many of our computa- 
tions in Sections 6 and 7, that the Li estimator for S is the ip transformation 
of the L2 estimator for 0,. A similar result holds for the L2 estimator for S, 
that is. 

The risk functions corresponding to the losses above are 

RL,{tL^) = E[uitL,,j:)], RL,{nL,) = E[u{nL,,n)], i = i,2. 

In the subsequent sections, these risk functions will be used to assess the 
quality of the eight estimators that we consider. For each of S and fi, the 
eight estimators considered will be the sample covariance matrix 5 (if ig- 
noring the graphical model structure) and its inverse, the mleg for T, and 
$7, Tig and ilg, and 

i=l,2 and i = l,2, 

where the prior tt will be a HIW or more generally an IW p^, or the reference 
prior. 
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5.2. Risk properties of Tig. We now proceed to state a decision-theoretic 
property of the maximum hkehhood estimator of S G Qg- From (2.9), it 
fohows immediately that 



i=2 



Lemma 5.1. The maximum likelihood estimator T,g is the best Li esti- 
mator in the class of estimators of the form aS,a G M where 



(5.7) I 

andU = j:'l^^ZiZl 



Lj=l j=2 



Proof. Recall that for the estimator aS under Li loss we have 

i2i(aS,S) = £;[(S^\aI]) -log|S-^aS| -r] 

= a(S~\^[S]) -rlog(a) -S[log|S-^S|] -r. 

Now since E[T,] = nT, (see [24], page 133) we have that a(S~^,£'[S]) = nra. 
Moreover, by (2.8), 



log 



detS 
detS 



E 



log 



ricec l^cl rises \^s\ 



E 



^ ,^~{l/2)y. V-(l/2)| 



Since S^^^^'^^EcS^ 
(see [31]), 



(1/2) 



-(1/2) 



-(1/2), 



~ Wcin,Ic) and E^^'/'^S^S^^'^'^ ~ Ws{n,Is) 



■■ Y[xl-j+i, 

i=i 



-(1/2) 



5^5 



,2 



i=i 



Letting i?[log |5]~^S|] =m which is a constant independent of S or a, we 
therefore now have Ri{a'E, S) = anr — rloga — m — r. Differentiating with 
respect to a and setting the derivative to zero gives a = 1/n, which proves 
the lemma. □ 
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6. Risk comparisons and numerical properties. In this section, through 
two examples, we investigate the performance of our Bayes estimators de- 
rived from the different priors presented in Sections 3 and 4. We base our 
comparisons on frequentist risk calculations obtained from simulations un- 
der losses Li and L2 for both S and 0,, on predictive properties and on 
eigenvalues properties. 

6.1. Example 1: Two Cliques." In this example, we illustrate the power 
and flexibility of the IWp^ family and its multiple shape parameters. We 
build a general example based on the call center data analyzed in [18] and 
described in Section 7. First, we define a graph G with 100 vertices (r = 100) 
where Ci = {1, . . . , 70}, C2 = {61, . . . , 100} and ^2 = {61, . . . , 70}. The true 
covariance matrix S is constructed by (i) taking the sample covariance of 
the first 100 variables in the call center data, (ii) removing the ij entries 
corresponding to which are not edges of G and (iii) performing the 

completion operation described in (2.1). This procedure guarantees that S 
preserves the conditional independence relationships specified in G. This 
example involves cliques of different dimensions (ci = 70 and C2 = 40) and 
our goal is to show that it is possible to obtain improved estimators by 
using an IWp^ prior with multiple shape parameters. The idea here is to 
apply different levels of shrinkage for each clique, that is, more shrinkage 
for larger cliques, following the intuition that more shrinkage is necessary in 
higher-dimensional problems. 

Our simulations compare estimators based on the reference prior, the 
traditional hyper inverse Wishart (one shape parameter, 5 = 3) and five 
versions of the IWp^. In the latter, we first choose the shape parameters in 
proportion to clique size. More specifically, we choose of the form 



with 6i equal to ^Ci, jCi and jqCi. In addition, we also compare risk obtained 
from empirical Bayes estimates of shape parameters. These were computed 
based on the following specifications of a and /? as a function of clique size: 

(i) ai = -(^^^p-^) and ft = zi^^^, 

(ii) Oi = aci + b and ft = asj + b. 

Notice that in order to obtain the empirical Bayes estimates of a and (3 
we maximize the marginal likelihood as a function of 6 in case (i) whereas 
in (ii) the maximization is done as a function of both a and b. For all 
choices of shape parameters, we define the scale matrix in two ways: the 
identity matrix I and a matrix D for which the prior expected value of 
S G Qg is /. Conditional on a shape parameter, D can be easily derived as 
we saw in Lemma 3.1. This example focuses on Li loss and the impact of 
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Table 1 
Two cliques: risk for estimators 





n = 


75 


n = 


100 


n = 


500 


n = 


1000 


Li(S) 




Li(E) Lr{n) ii(S) 


Reference 


212.7 


66.61 


60.71 


40.93 


7.02 


6.66 


3.33 


3.28 


HIWi3,I) 


98.76 


59.28 


80.72 


43.41 


7.76 


7.18 


3.54 


3.43 


Empirical Bayes (i) 


127.47 


58.43 


78.05 


42.87 


7.70 


7.13 


3.52 


3.41 


Empirical Bayes (i)-D 


27.44 


25.84 


23.74 


21.95 


6.31 


6.02 


3.20 


3.12 


Empirical Bayes (ii) 


121.55 


57.12 


74.45 


41.81 


7.60 


7.05 


3.51 


3.39 


Empirical Bayes (ii)-D 


24.78 


23.04 


21.48 


19.95 


6.12 


5.86 


3.16 


3.08 


IWpail/2c,,D) 


29.99 


25.18 


24.53 


24.49 


6.37 


6.21 


3.27 


3.22 


IWp^{\/2c,,I) 


207.4 


67.88 


116.7 


49.78 


8.69 


7.80 


3.76 


3.61 


IWp^[l/Ac,,D) 


22.18 


17.96 


18.57 


15.87 


5.67 


5.43 


3.03 


2.96 


IWp^[l/Ac,,I) 


165.5 


63.10 


96.14 


46.20 


8.14 


7.43 


3.67 


3.50 


IWp^{l/lQci,D) 


35.71 


31.99 


31.59 


27.02 


6.77 


6.41 


3.32 


3.23 


IWp^[l/lOciJ) 


141.7 


60.23 


89.67 


45.03 


7.98 


7.32 


3.59 


3.47 


MLEg 


813.9 


70.72 


154.6 


43.51 


8.13 


6.79 


3.62 


3.32 


MLE 






7.3 X 10* 


102.5 


14.45 


10.85 


6.00 


5.22 



flexible priors in estimating the eigenstructure of S and Q. The results in 
Table 1 show that appropriate choices of different shape parameters have a 
significant effect in reducing the risk of estimators. Looking at Li(r2), and 
n = 75, for the estimator under IW Pf.{jCi, D) there is approximately a 78% 
(76% for n = 100, 20% for n = 500 and 9% for n = 1000) reduction in risk 
when compared with the more traditional HIW{3,I). When comparing to 
the constrained maximum likelihood estimator (mleg) the reduction is even 
more impressive: 97% for n = 75, 87% for n = 100, 30% for n = 500 and 
16% for n = 1000. Additionally the reference prior performs well and always 
beats the mlcg. 

We emphasize the connection between Li and the eigenstructure of Q and 
S by the scree plots in Figure 1. It is our belief that the superior performance 
of the Bayesian estimators under the IW Pq and the reference prior relative 
to the mleg is a direct consequence of the better estimation of the eigenvalues 
of both the precision and covariance matrices. 

This experiment also indicates that choosing the amount of shrinkage as 
a function of shape parameters can be a delicate task. Here, we find that 

= \ci (i.e., 5i = 17.5 and 82 = 10) performs best and seems to be a good 
compromise between 5i = and 6i = ^Ci. The definition of appropriate 
hyperparameters and the choice of shrinkage level is context specific and 
depends on the amount of prior information available. An alternative to 
the subjective specification of the hyperparameters is the empirical Bayes 
approach presented. The results in Table 1 show that this alternative per- 
forms reasonably well and uniformly outperforms the reference prior (the 
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Top eigenvalues of fi Bottom eigenvalues of Q 




Top eigenvalues of S 
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Bottom eigenvalues of E 




s6 se 100 



Fig. 1. Scree plots of eigenvalues for (top row) and E (bottom row) in the simulation 
with n = 100. For each estimator, the lines represent the average of the eigenvalues after 
1000 simulations. "Best" refers to the estimator with lowest risk (IW p^{l/4:Ci, D)). 



other objective alternative). As we show in this example, the IWp^ offers 
a very general framework for the incorporation of prior knowledge with the 
ability to significantly improve the performance of posterior estimators and 
estimation of covariance matrices in general. 



6.2. Example 3: choosing the graph. Our second example demonstrates 
the potential of graphical models as a model-based tool for regularization 
and estimation of large covariance structures. So far, we have presented 
examples that compare the performance of different estimators (based on 
different priors) assuming knowledge of G. In real problems, G is unknown 
and often has to be inferred before parameter estimation. From a Bayesian 
perspective, model selection involves the exploration of the posterior distri- 
bution of graphs given by 

(6.1) piG\X)^p{X\G)p{G), 

where p{X\G) is the marginal likelihood of G and p{G) represents its prior. 
In Gaussian graphical models the marginal likelihood for any G is given by 
the following integral: 

(6.2) P{X\G)= f /(X|S,G)^(S|G)dS, 
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where f{X\T,, G) is the density of X given S and G and tt(Y,\G) is the prior 
distribution for S given G. Using IWp^ conjugate priors for E makes the 
computation of the above integral straightforward since the expression of the 
marginal likelihood is obtained explicitly through the normalizing constant 
of the IW Pq as given in (3.4). If we assume a uniform prior over the graphs, 
computing the posterior distribution of graphs is equivalent to computing 
the marginal likelihoods. 

To illustrate how graphical models can be used as a regularization tool, 
we build an example where the underlying graph is unknown and will be 
selected based on marginal likelihoods of a particular restricted set of graphs. 
We focus on a subclass ^bgo of graphs where the precision matrix is "banded" 
(see [2]). The restricted subclass ^beo consists of the decomposable graphs 
Gk^k = 1, . . . , 60 with cliques Cj = {j, j + 1, . . . , j + k},j = 1, . . . ,r — k. The 
graphical Gaussian model Markov with respect to G^ can be viewed as an 
AR(k) model and the corresponding precision matrix is a banded matrix 
with a band of width k + 1, indicating that all elements beyond the kth. 
supra-diagonals are zero. For added simplicity, we use the HIW{3,I), a 
special case of the IWp^, as a prior for S. 

As in the previous example we build the true covariance matrix from 
the sample covariance of the call center data using a graph corresponding 
to k = 20 followed by the completion operation in (2.1). We proceed by 
sampling n observations from a Nr{0,T,), 1000 times. At each iteration, we 



Table 2 

Choosing the graph example: risk for estimators 









i?2(n) 


i?2(S) 






n= 100 


k = 4.36 




Reference 


15.36 (23.67) 


17.14 (19.85) 


1902.6 (13375.0) 


22.19 (241.2) 


HIW{3,I) 


14.76 (22.77) 


16.55 (19.11) 


1736.3 (10350.0) 


16.23 (56.88) 


MLEg 


15.89 (32.64) 


18.42 (21.08) 


1876.0 (9897.80) 


16.54 (57.74) 


MLE 


9.9 X 10*^ 


102.53 


1.1 X 10^** 


133.08 






n = 500 


k = 7.38 




Reference 


8.084 (11.94) 


9.078 (11.79) 


1105.6 (1500.6) 


5.648 (16.01) 


HIW{3,I) 


8.053 (11.96) 


8.961 (12.81) 


1101.7 (1234.5) 


5.070 (11.09) 


MLEg 


8.129 (12.20) 


9.297 (15.85) 


1116.8 (1571.2) 


5.088 (11.12) 


MLE 


14.45 


10.85 


3147.4 


26.43 






n = 1000 


fc = 18.80 




Reference 


2.256 (1.930) 


2.203 (1.893) 


345.5 (310.3) 


6.624 (7.009) 


HIW{3,I) 


2.259 (1.936) 


2.197 (1.899) 


350.4 (317.9) 


5.480 (5.736) 


MLEg 


2.311 (1.992) 


2.232 (1.910) 


331.8 (291.3) 


5.492 (5.747) 


MLE 


6.003 


5.226 


1006.8 


13.20 



The values in parentheses refer to the risk of estimators generated by the "oracle" (fc = 20). 
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compute the marginal likelihood for all graphs with k = 1, . . . ,60, choosing 
the top model to proceed with the estimation of S. For each estimator, under 
the different priors, losses are computed and risk compared. We also inves- 
tigate the performance of the procedure by comparing estimators generated 
through the "oracle" that knows the correct value of k. The corresponding 
risk values are given in brackets. 

We repeat this exercise for n = 100, 500 and 1000 with results presented 
in Table 2. In each case, the average of the best k^s is denoted k and is also 
given in the table. 

It is clear from the example that more parsimonious models are selected 
from small sample sizes. Indeed, for n = 100 we choose the average band 
size k = 4.36; for n = 500, we choose k = 7.38 and for n = 1000, k = 18.80. 
This highlights the Ockham's razor effect of marginal likelihoods [20], in 
selecting a graph. Moreover, for n = 100 and n = 500 the losses generated 
by the oracle are always larger than those of our estimators, with the oracle 
only being relatively competitive for n = 1000. 

7. Call center data. In this section, we apply our methodology to the 
call center data analyzed in [2, 18]. With this example, we will illustrate the 
predictive properties of our estimators and the flexibility yielded by graph- 
ical models. Indeed, we will show that our estimators when using banded 
matrices and the IW have a smaller predictive error than the mlcg. More 
strikingly, we will show that when using bands varying in width along the di- 
agonal together with the IW , our estimators yield significantly improved 
predictive power over the best uniformly banded model. 

The dataset in this example constitutes records from a call center of a ma- 
jor financial institution in 2002 where the number of incoming phone calls 
during 10-minute intervals from 7:00 am till midnight were recorded. Week- 
ends, public holidays and days with equipment malfunction are excluded, 
resulting in data for 239 days. The number of calls in each of these intervals 
is denoted as Nij, i = 1, 2, . . . , 239 and j = 1, 2, . . . , 102. A standard trans- 
formation Xij = {Nij + ^)^^^ is applied to the raw data to make it closer to 
Normal. 

7.1. Analysis with banded precision matrices. We consider the class of 
models Gbm as described in Section 6.2 and as we saw there, choosing a 
model Markov with respect to Gk € Gbm is equivalent to banding the inverse 
covariance matrix and our results can therefore be readily compared to those 
of Bickel and Levina [2]. It is important to note, however, that our approach 
differs from that of [2] in two ways. First, in [2] the banded estimators for 
the precision matrix are used as estimators of the saturated We, on the 
other hand, fit graphical models to the call center data, explicitly assuming 
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that the true S is such that is in Pq, that is, has fixed zeros. Second, 
we use the eight estimators for S as described in Section 5.1. This includes 
the traditional frequentist estimator for the graphical model, mleg, and the 
Bayesian estimators that we have developed above. 

We employ both the traditional cross-validation and Bayesian model selec- 
tion procedures to determine the "best" model among the class of graphical 
models with fe-banded precision matrices. The cross-validation procedure is 
done through the i^-fold cross-validation method with K = 10. The dataset 
with 239 data points is divided into 10 parts (the first nine parts have 40 
observations and the last has 39 observations). We predict the second half of 
the day, given data for the first half, on the test data after computing estima- 
tors on the training dataset. In particular, we partition the 102-dimensional 
random vectors Xi, X2, ■ ■ ■ , X23g into two equal parts, each representing the 
first and second half of the day as follows: 

(2) ' ^=L,,J' ^" 
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where xf^ = {xii,Xi2-, ■ ■ ■ -.Xi^if and xf^ = (xi52, Xjss, . . . , Xiio2)*- The mean 
vectors are partitioned in a similar manner. The best linear predictor for 
xf'^ from xf"^ is given by 



X. 



ll2 + ^-2\^\\\Xi - m) 



We use the prediction equation above with the following eight estimators for 
E: the mle and mleg, the estimators based on Li loss, , , S^^-^ , 

and the estimators based on L2 loss, S^^^'' , S^^'^. For the Bayes 
estimators we use the traditional choice of the identity matrix as the scale 
hyperparameter and the shape parameters for the IWp^ are set as Oi = 
— 5,Vz,/?2 = —4 + |. The prediction error on the test data is measured by 
the average absolute forecast error. The Bayesian model selection procedure 
entails choosing the model with the maximum marginal likelihood according 
to the principles given in Section 6.2. Here the prior distribution used for 
S € Qg is the IW with the same hyperparameters as above. 

For all eight estimators, the cross-validation procedure identifies k = 4 
as the model with the lowest prediction error for the second half of the day 
given the first. The Bayesian model selection procedure identifies fc = 5 as the 
model with highest marginal likelihood. We note that both model selection 
procedures yield very similar, parsimonious models for the call center data. 

We proceed to compare the forecast performance (or prediction error) of 
our estimators. As done in [2], we forecast the second half of the day based 
on the first half using the first 205 data points as the training set and the 
last 34 as the test set. The prediction error lines are given in Figure 2. 
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S 10 15 SO as 30 35 40 45 SO 

Tims Point 



Fig. 2. Forecast error for selected banded and ''differentially handed" models. 



The S^^^ and '^^i prediction error lines are so close that it is difficult 
to distinguish between the two in a plot and thus we only show the one for 

^'^ . Prediction errors using the Bayes estimators are all lower than the 

prediction error given by the mleg and by the standard sample covariance 

matrix, the mle. For the sake of clarity, Figure 2 gives the forecast error for 

the best model {AR{A)) chosen via cross-validation for the mle, the mleg, 
~ IWp 

T,j^^ ^ and another estimator, "Li/14^PgdifF' we will describe in Section 
7.2. 

Besides the overall poor performance of the standard estimator 5, it is 
also well-understood that S overestimates the largest eigenvalues and under- 
estimates the lowest eigenvalue. An examination of the scree plots implied 
by the eigenvalues of our Bayes estimators (not shown here for brevity) re- 
veals that our Bayes estimators compared to S have lower estimates of the 
largest eigenvalues and higher estimates for the smallest eigenvalues — hence 
the Bayes estimators seem to shrink the eigenvalues toward the center of the 
eigenspectrum, a property often sought by estimators proposed in previous 
work [9, 25, 35]. 

7.2. Analysis with differential banding. In the above analysis we restricted 
ourselves to the class of /c-banded inverse covariance models or AR{k) mod- 
els. This approach highlighted, among other important properties, the fact 
that banding the inverse covariance matrix, as carried out in Bickel and 
Levina [2], essentially entails fitting graphical models. 

We noted that the cross-validation error from predicting the second half 
of the day given the first half suggested that the AR(A) model (i.e., fc = 4) 
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gave the lowest prediction error. The cross-vahdation error from predicting 
the first half of the day given the second half suggested that the AR{16) 
model (i.e., k = 16) gives the lowest prediction error. These two different 
values of k suggests that one single k may not be sufficient to explain the 
features of this dataset. An examination of the sample correlation matrix 
suggests different correlation structures during the first and second half of 
the day. In particular the correlation between variables which are farther 
apart is stronger in the first half of the day than in the second half of the 
day, suggesting that the order of the lag, which represents the level of the 
strength between neighboring variables, could be different in different parts 
of the day. It seems that "differential banding" of the inverse covariance 
matrix is necessary to capture this effect and fitting a straight /c-banded 
model, as was done previously, may not necessarily allow the fiexibility that 
we want. 

A natural approach to obtaining this flexibility is to frame the problem 
once more in the context of graphical models. Let us now consider graphical 
models with two different clique sizes for the two parts of the day as an 
extension of the single clique size suggested by A;-banded models. We note 
that A;-banded models have cliques of size k + 1 and separators of size k. 
Let us consider what we term as {ki,k2,r) "differentially banded" models, 
where fci + 1 represents the size of the cliques in the first part of the day, 
A;2 + 1 the size of the cliques in the second part of the day, and r the point 
at which the change takes place — and in our case this point r will be the 
variable at which the last clique of size ki + 1 ends^ The next clique after 
this last clique of size ki + 1 will be the clique of size A:2 + 1 such that only 
variable r + 1 does not belong to the previous clique of size ki + 1. The 
cliques in the second part of the day will now cascade as before but will be 
of size k2 + 1. 

We keep the same hyperparameters {a,l3,9) as in Section 7.1 since they 
still satisfy the conditions in Section 3.1 for the given perfect order P of 
cliques. The same K-iold cross-validation procedure is used to select the 
{ki,k2,r) model with the lowest prediction error. We found that the {ki = 14, 
^2 = 4, r = 58) differentially banded model has the lowest prediction error 
when using the IW p^. priors. These results are consistent with those from the 
cross-validation in Section 7.1 and show that there is compelling evidence to 
suggest that there are different correlation structures during different parts 
of the day. For a concrete illustration of the benefits of differential banding, 
and for comparison purposes with [2] , we also considered the task of choosing 
the best model for predicting the second half of the day based on the first 
half, as done in Section 7.1. In this case, the model with the lowest prediction 



^Naturally one can also extend this concept to multiple banding. 
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error using the IW prior Bayes estimates under the Li loss turns out 
to be the {ki = 14, k2 = 1, r = 55) model. Other estimators also perform 
comparably but we omit the details here. The corresponding forecast error 
from using this model in comparison with the fc-banded models from before 
is given in Figure 2. Clearly, the "differentially banded" model gives us a 
substantial reduction in prediction error. Using the /c-banded model yields a 
23% reduction in prediction error over the sample covariance matrix S and 
the "differentially banded" model gives a 16% improvement over the best 
fc-banded model. Our new class of models gives better prediction than the 
A:-banded models for almost all time points. 

The results above highlight the performance of our estimators and their 
versatility in different settings. Moreover, and perhaps equally important is 
that taking a graphical models approach yields a much richer class of models 
than simple banding and achieves this flexibility in a very natural way in 
high-dimensional problems. 

8. Discussion. In this paper we considered the estimation of high-dimen- 
sional covariance and precision matrices in Gaussian graphical models us- 
ing a family of flexible, conjugate priors with multiple shape parameters. 
Existing Bayesian methods resort to either using the restrictive Diaconis- 
Ylvisaker conjugate prior with only one-dimensional shape parameter or 
using MCMC methods, both of which can be completely inadequate or at 
times even infeasible in very high-dimensional settings. Our objective in this 
paper was to overcome both of these obstacles and develop a comprehensive 
Bayesian solution to the problem at hand. 

We derived the form of the Bayes estimators under two commonly used 
loss functions, adapted to graphical models, for our flexible class of priors. 
Another important contribution of our work is the derivation of a nonin- 
formative reference prior for S and $7. Finally, we observe that our Bayes 
estimators have good frequentist risk properties and yield shrinkage in the 
eigenvalues. 

The unique set of properties of the approach proposed in this paper for 
the estimation of large covariance matrices makes it a viable and competitive 
methodology. Nevertheless, there is further scope to fully assess the proper- 
ties of our estimators. For instance, we would like to know more about their 
asymptotic properties when both r and n become large and we would also 
like to know more about the behavior of their eigenvalues. These and many 
other questions will be the subject of further work. 

APPENDIX 

A.l. Proof of Theorem 3.1. To compute the mean of the IWp^, that is, 
to prove Theorem 3.1, we first need to recall the definition of the normal 
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matrix variate and to prove the lemma below. Let us first recall the form of 
the matrix normal distribution as given in [31], page 79. Let X be an r x s 
random matrix. Then we say that X ~ N(m, ai (8) (T2) where cji is r x r and 
o"2 is s X s if its density is of the form 



Lemma A.l. Let X gW^^ follow a normal distribution Nrxsi'mjO'i ® 
CJ2) where ai is a positive definite s x s matrix and 02 is a positive definite 
r X r matrix. Let a = {ci'ij)i<i,j<s be a given fixed s x s matrix. Let 

he the r x r block in the ith block row and the jth block column of the 
rs X rs covariance matrix (o"i (72) where the rows are divided into s sets 
of r rows and so are the columns. Similarly, divide the [rs x rs matrix 
vecE(X){vecE{X)Y] matrix into s^ blocks 

{YecE{X){YecE{X))%, l<i,j<s, 

of size r X r. Then 

(A.2) E{XaX^)= aij((o-i ®f72)y + (vec(m)vec(m)*)ij) 

(A. 3) = mam* + tr((Ti a*) (T2- 

Proof. Let Xi,i = 1, . . . , s denote the columns of the r x s matrix X; 
then a straightforward calculation shows that XaX* = X]i<ij<s cnjXiX^^ and 
therefore 

E{XaX')= aijEiXiX]). 

l<jj<s 

Since by [31], if X ~ Nrxsi^, ci ^ o"2), then vec(X) ~ A^rs(vec(m), cJi (T2), 
it is clear that 

E(XiX]) = co-v{Xi,Xj) + E{Xi)E{XjY = {ai a2)ij + (vec(m) vec(?n)*)ij, 
E{XaX^)= Y aij((<Ti (g)a2)ij + (vec(m)vec(m)*)jj), 

l<i j<s 

which gives (A.2) and (A. 3) can be verified by inspection. This completes 
the proof of the lemma. □ 

Proof of Theorem 3.1. Using the distributional properties of vari- 
ous subblocks of an IW matrix as given in Theorem 4.4 of [27], we will 
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compute the expected value of the different entries of the matrix X. Because 
of the difference dependences, we shall proceed in the following order: 

E{X(2)) 

= E{X[12))E{X(^2)) = E{E{xii2)\x^iy))E{x^2)), 

Eixc,\S2) = E{X[i].) + E{X[12)X(2)X{21]) 

= E{X[1].) + E{X[12)E{X/^2))X(21]) 

= E{X[1].) + E{E{X[12)E{X/^2))X(21] \X[1].)), 

Eix[j)) = E{{x[j)xJ^pX(^j)) = E{{x[j)X^p)E{xQ'^), j = 2,...,k, 

Eix[j]) = Eix[j].) + E{{x[j)X^pXQ) {x^}^X(^j])) j = 2,...,k, 

= E{xu^.) + E{{x[j)X^^^)E{x{j)){x^^^X{j^)) 

= E{x[j^.) + E{E{{x[j)X^:\)E{x{j)){x^}^X{j^)\xy].)). 

Following the general formula for the expectation of an inverse Wishart 
distribution we have 

E{X^2))^ ^^'^ 



{ai + ((ci - S2)/2) + 72) - ((S2 + l)/2) 

0(2) 



-(ai + ((ci + l)/2)+72)" 



Next, 



E{xc,\S2,S2) = E{x,^i2)X^2)) = E{X[12))E{X{2)) 

(A.5) =%2> 



0(2) 



(ai + ((ci + l)/2) + 72) 

0Cl\S2,S2 



-(ai + ((ci + l)/2)+72)' 



Next, 



Eixc^\S2) = E{xii].) + E{X[12)X(2)X/^21]) 

= E{xii].) + E{X[12)E{X{2))X(21]) 

= Eix^iy) + E{E{x[i2)Eix(^2))X(21] \x[i].)) 

^ %L 

-(ai + ((ci-S2 + l)/2)) 
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(A.6) 



1 trg^jg(2) 
+ 2-(ai + ((ci + l)/2)+72)''W' 

0\l]. ^ %\52,52^(2>^52,Ci\52 



+ 



(ai + ((ci - S2 + l)/2)) -(ai + ((ci + l)/2) + 72) 

S20[l]. 

2(ai + ((ci + l)/2) + 72)(ai + ((ci - S2 + l)/2)) 



+ 



(ai + ((ci - S2 + l)/2)) V 2(ai + ((ci + l)/2) + 72) 

^Cl\S2,S2^~{'2)^S2,Cl\S2 

-(qi + ((ci + 1)/2) + 72) 



and 
(A.7) 



^(^[i>) =^((^[i>^(i))^(i>) =^((^[i>^(i>))^(^(i)) 

where E{x(^j^) is given by the previous calculations (note that we are com- 
puting E{X) layer by layer starting from the top). 

Now, using the same type of calculations as in the computation of 
we have, for j = 2, . . . , n, 

=£;(xy].) +-B(-E((xy)X^.J)£;(X(j))(2;^.Jx(j])|X[j].)) 



("j + ((ci - Sj + 



^ ^ ^ + -2(a, + ((c,-s, + l)/2))^yi- 



ail- 



l + itr(0^-:;ii;(x^,)))) 



_(a. + ((c.-s, + l)/2)) 

This completes the proof. □ 
A.2. Proof of Theorem 4.1. 

Proof. Following (4.1), we see that the log-likelihood for cj) is 

/ (0) = / (cr~^_ , O- [12) , 0-^" ) ' ' ^[i> ^(j) ' i = 2, • • • , ^) 
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(A.9) 



J=2 

fc 

k 

In order to obtain the information matrix H'^^cj)) = E{^-^^) we differen- 
tiate the log-likehhood twice. We can then see that the Fisher's information 
matrix is block diagonal with blocks H^{(j)),j = 1, . . . ,k + \ according to 

<Al=cr('^), 02 = (0-[j;pCr[i2)), = (crypCry)CJ^^j), j = 2,...,k. 

In fact, since E{x^i2)) = (T[i2) and ^'(xy^x^.p = ay^a'^jyj = 2,...,k, each 
Hj'{(j)),j = 2, . . . ,k + 1 is itself block diagonal and its determinant is 

fc+i 

detif^((/.) = []deti/f ((/.), 
1=1 

where 

det{Hf{cf>)) = \a(^,^r^\ 
(A.IO) detiH^m = \a[,yr'\EiXf^,^)r~''\a[^.r~''+\ 

for j = 2, . . . , k. 

In the general case where X ~ WQ^{a, f3,a) we would have to use the 
expression of E{X) as given in (4.21) of [27]. In the case that concerns us 
here, X is hyper Wishart and it is well known that E{X) oc a and therefore 

E{x{j))oca(^jy 

Thus, up to a constant independent of o", we have 

det(i/f(0)) = |a<2>r'+\ 

det{Htm = k[i].r^^ia<2>r^"^^k[i].r^-^^+\ 

det{Hf^,{^)) = ky].r^~''^+Vb1-r'^>0-)r^"'^' for j = 2,...,k, 
and therefore for / = 1,...,A; + 1, det Hf((j)) are of the form 
detiff = ai{(l)i)bi{(t)i,...,(t)i^i) 



(A.ll) 7t'I'{4>) = k(2)l^'^+'^/V[i].|"'^+'^/'^"'' nk[:'^ |(fe+i)/2)-=. 
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where ai and bi are functions from the parameter space to M''". More pre- 
cisely, we see that 

ai+i{4>l+i) = k[i].r'^^~^''S l = 2,...,k. 

According to the theory developed for natural exponential families by [10] 
and recalled in Section 2.3 of [7], the reference prior for (p, with the given 
order of its component, is 

fc 

i=2 

In fact, we see that the prior is independent of the order of these components 
and (4.3) is proved. 

We now want to derive the induced prior for a. Meticulous but relatively 
easy computations show that the Jacobian from (f) to is equal to 

k 

i=2 

Moreover, the Jacobian from to a is known (see [32]) and equal to 

k k 

(A.12) n(Ki-ii%-)i)"'^^"'n(i%-)i)^^^'^'^ 

where some of the separators may be identical. Therefore the Jacobian from 
(/) to o" is 

Therefore the induced prior for a is 

1^ |({s2+l)/2)-Ci-l-C2+S2l |((ci+l)/2)-S2+S2-Cl-l 
k k 

i=2 i=3 

^ I |-{(ci+C2)/2)-((ci+l)/2)-((c2+l)/2)+((s2+l)/2)+S2| |-((ci+l)/2) 
I (2) I I [IJ • I 

xnia,,r«^^-+^)/^)n(ia,,i)(^-^^) 

i=2 j=3 

|^^j-((c,+l)/2)j^^.^^|^^^|-((c,+l)/2) 
|(j^^|((ci+C2)/2)-S2-({s2+l)/2) |^^j((c,-s,)/2)-((s,-+l)/2) ' 
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and Theorem 4.1 is proved. □ 
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